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Abstract 

Motivation: In mass spectrometry-based shotgun proteomics, protein quantification and pro- 
tein identification are two major computational problems. To quantify the protein abundance, 
a list of proteins must be firstly inferred from the sample. Then the relative or absolute protein 
abundance is estimated with quantification methods, such as spectral counting. Until now, re- 
searchers have been dealing with these two processes separately. In fact, they are two sides of 
same coin in the sense that truly present proteins are those proteins with non-zero abundances. 
Then, one interesting question is if we regard the protein inference problem as a special protein 
quantification problem, is it possible to achieve better protein inference performance? 

Contribution: In this paper, we investigate the feasibility of using protein quantification meth- 
ods to solve the protein inference problem. Protein inference is to determine whether each 
candidate protein is present in the sample or not. Protein quantification is to calculate the 
abundance of each protein. Naturally, the absent proteins should have zero abundances. Thus, 
we argue that the protein inference problem can be viewed as a special case of protein quan- 
tification problem: present proteins are those proteins with non-zero abundances. Based on 
this idea, our paper tries to use three very simple protein quantification methods to solve the 
protein inference problem effectively. 

Results: The experimental results on six datasets show that these three methods are competi- 
tive with previous protein inference algorithms. This demonstrates that it is plausible to take 
the protein inference problem as a special case of protein quantification, which opens the door 
of devising more effective protein inference algorithms from a quantification perspective. 



Availability: The source code of our methods is available at: |http : //code . google . com/p/protein- inference/ 



Key words: Shotgun proteomics, protein inference, protein quantification, spectral counting, 
linear programming. 
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1 Introduction 



Mass spectrometry (MS)-based shotgun proteomics is currently the most widely used method for 
the identification and quantification of proteins As shown in Fig(TJ it first digests the sample 
into a mixture of peptides by enzymes such as trypsin. The resulting peptide mixtures are scanned 
by tandem mass spectrometry (MS/MS) to generate a set of MS/MS spectra. Then the peptide 
search engine reports a set of peptide-spectrum matches (PSMs) by searching the MS/MS spectra 
against a protein database. From these peptide identifications, we infer the existence of proteins 
with protein inference algorithms and calculate the relative or absolute abundances of proteins 
with protein quantification approaches 0, 0] • 

Until recently, people tackle the identification and quantification of proteins as two individual 
and subsequent tasks: first select a subset of proteins that are truly present and then determine 
the abundances of these proteins. For both problems, many elegant approaches (3433] have been 
developed in the past decades. The readers can refer to two recent reviews [1, 0] for details. 

The starting point of this paper is the observation of some key underlying connections between 
these two problems. In protein inference, the objective is to generate a binary presence indicator 
value (1 or 0) for each candidate protein. In this regard, "protein existence inference" is probably 
more accurate in describing the original protein inference task. In protein quantification or protein 
abundance inference, the objective is to determine the abundances of a set of proteins. Clearly, 
if one protein is not present, its abundance should be 0. Hence, we argue that the protein 
inference problem can be investigated from the perspective of protein quantification: present 
proteins are those proteins with non-zero abundances. In other words, we can adopt available 
protein quantification methods directly to solve the protein inference problem. This new angle 
may enable a better understanding of the protein inference problem and help in devising improved 
or hybrid methods by combining elements from two areas that would otherwise be considered 
incompatible. 

As a proof of concept, we investigate the feasibility of solving protein inference problem with 
existing protein quantification methods in the context of label-free proteomics. In label-free quan- 
titative proteomics studies, quantification methods based on peak ion intensities (from MS data) 
H and spectral counting (from MS/MS data) 0,0 have been widely used. 

Spectral counting measures the abundance of each protein based on the number of MS/MS 
spectra that match its constituent peptides. Compared to peptide intensity values, spectral count- 
ing information is easier to obtain since we just need to count the number of the MS/MS spectra. 
In this paper, we use spectral counting as the quantification approach for solving the protein 
inference problem. 

We first try two simple spectral counting methods in the literature. In both methods, the 
protein abundance is calculated as the sum of peptide abundance. Their difference lies in how 
to handle the shared peptide. If the abundance of one shared peptide is b and it has k parent 
proteins, then b is used as its abundance in the first method while b/k is used in the second 
method. These two methods assume that all the candidate protein are present in the sample and 
they have non-zero abundances. However, this assumption contradicts the objective of protein 
inference: distinguish present proteins with non-zero abundances from absent proteins with zero 
abundances. Thus, we come up with another linear programming model to shrink some protein 
abundances to zero. 

To our knowledge, our paper is the first attempt to use protein quantification methods for 
protein inference. Such an attempt connects two important computational problems that have 
long been investigated separately. The experimental results show that we can obtain better 
performance in most datasets even when the most simple version of spectral counting is utilized. 
Hence, the advance in protein quantification studies will promote the development of more effective 
protein inference algorithms. 

In Section 2, we describe the details of three methods. Section 3 shows the experimental results 
on six datasets. Section 4 concludes the paper. 
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Figure 1: Protein identification and quantification using mass spectrometry in shotgun proteomics. 
There are three major computational problems: peptide identification, protein inference and pro- 
tein quantification. 



2 Methods 

As shown in the left side of Figj2j the input of the protein inference problem can be represented as a 
tripartite graph G = (X U Y U Z, E\ U E<i) where X, Y and Z are the set of I MS /MS experimental 
spectra, m identified peptides and n candidate proteins respectively. For all Xi G X, yj G Y, 
there is an edge (xi,yj) G E\ if and only if spectrum matches the peptide in the peptide 
identification results. Similarly, (yj,Zk) G E2 means that peptide yj is one part of the protein 
sequence z^ . Each MS/MS spectrum corresponds to one and only one identified peptide whereas 
some peptides may have more than one matching spectrum, such as peptide j/2 and 2/3 in Figj2j 
The relationship between the peptides and proteins is more complex: candidate proteins may 
have several identified peptides and peptides can be shared by multiple proteins. How to correctly 
distribute these shared peptides is one of the most challenging problem in protein inference. 

We first formulate the protein inference problem as a special case of protein quantification 
problem. The objective of protein inference is to determine whether each candidate protein is 
present in the sample. The aim of protein quantification is to estimate the abundances of a set 
of proteins. Clearly, if one protein is not present, its abundance should be 0. In this paper, 
the protein inference problem is re-visited from the perspective of protein quantification through 
seeking those proteins with non-zero abundances. 

To obtain the protein abundance, we start with calculating the peptide abundance. The 
abundance bj of peptide yj is calculated as the sum of PSM probabilities (or scores): 

b j = Yl W 

where ctj is the probability that spectrum Xi matches peptide yj. di can be also viewed as the 
weight of edge (xi,yj) G E\, which can be obtained from peptide identification algorithms such as 



Mascot [33] or post-processing tools such as PeptideProphet [38]. In traditional spectral counting 
methods, the peptide abundance is simply the number of MS/MS spectra identified for each 
peptide. Here, we generalize this spectral counting method to account for the quality of PSMs. 
More precisely, the contribution of each spectrum to the peptide abundance is a quantitative value 
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between and 1 rather than a fixed value of one. Such an extension is extremely important for 
protein inference since it may help us to distinguish proteins with the same number of PSMs. 

To calculate the protein abundance, we need to distribute the abundance of each peptide to its 
parent proteins. The main difficulty is how to deal with the degenerate peptide that is shared by 
more than one protein since such a peptide can be generated by any subset of its parent proteins. 

There are several approaches to solve the shared peptide problem in protein quantification 
32l 3^ . 4o| , as shown in the right side of Figj2j The first approach is to simply discard the shared 



peptides and only use the unique peptides to calculate the protein abundance. But this approach 
has one disadvantage: it causes the information lost, especially for proteins whose identified 
peptides are all shared peptides. In FigEJ if we delete the shared peptide y2, then proteins zi 
and z% don't have any identified peptide and they would be considered as being absent in the 
sample. In fact, at least one of these two proteins must be present if we assume the existence of 
peptide 2/2 • Alternatively, we can use both unique and shared peptides to estimate the protein 
abundance. In the second approach, the abundance of each shared peptide is utilized in the 
abundance calculation of all its parent proteins. In other words, each peptide is counted multiple 
times so that the abundances of some proteins may be over-estimated. We call this method 
"multiple counting" in this paper. For example, peptide yi in Figj2]is counted twice in the second 
approach, which means that we artificially increase the abundance of peptide y2 from &2 to 2 * 62- 
The third approach divides the abundance of one shared peptide into different parts and then 
distributes each part to one of its parent proteins. This approach ensures that each peptide is 
"counted" only once. One typical representative in this category is the "equal division" method, 
which partitions the peptide abundance into k equal parts (k is the number of proteins that share 
this peptide). 

Since both multiple counting and equal division are the most popular and simple approaches for 
spectral counting based protein quantification, we first try these two methods for protein inference 
and see how they perform. Both of these two methods assume that all the candidate proteins are 
present in the sample and should have non-zero abundances. However, this assumption doesn't 
hold in protein inference because some absent proteins should have zero abundances. Thus, a new 
linear programming model is proposed as well to distribute peptide abundance automatically and 
set the abundances of some proteins to be zero. 

2.1 Multiple Counting 

In this method, shared peptides are used in the same way as the unique peptides and receive no 
special treatment. The protein abundance is simply the sum of peptide abundance from both 
shared and unique peptides corresponding to each protein: 

(Vj,*fc)E#2 

where Ck is the abundance of protein z%. If peptide yj has qj parent proteins, then it is counted 
qj times and its actual abundance used in the calculation is qj * bj . 

2.2 Equal Division 

Different from the above method that counts shared peptides multiple times, this method counts 
each peptide only once. It equally distributes the abundance of each shared peptide to its parent 
proteins: 

<*= E ^. (3) 

where qj is the number of candidate proteins sharing peptide yj. If peptide yj is a unique peptide, 
then qj = 1. 
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Figure 2: Three approaches used in the spectral counting for solving shared peptide problem. y\ 
and y3 are unique peptides while y 2 , y^ and y$ are shared peptides. The abundance of peptide 
yj is represented by bj. We use peptide y 2 as an example to explain how these three approaches 
work. 

2.3 Linear Programming Model 

For each identified peptide yj, the peptide abundance can be computed as: 



{k\{yj,z k )&E 2 } 



(4) 



where djk can be interpreted as the abundance that protein Zk contributes to peptide yj. The 
variable djk can serve as the bridge between peptide abundance and protein abundance. On 
one hand, we can use djk to explain the known peptide abundance. On the other hand, we can 
calculate the unknown protein abundance through djk- Therefore, the protein quantification based 
protein inference problem is equivalent to find an optimal matrix D = (djk). 

According to the above analysis, we propose a linear programming (LP) model to solve the 
protein inference problem: 



min tk 



k=i 



Vj, k : d jk < t k 
E d <* = 



Vj, k : djk 



Vj : bj - ^ djk 

{fc|(%,zfc)e£ 2 } 
f=0 if{yj,z k )iE 2 



I > else 

Some further illustrations on the model are listed as follows: 



(5) 

(6) 
(7) 

(8) 



Constraint ([6]) is to find the maximum value in each column vector dk (the kth. column of 
matrix D). Since we regard the proteins with non-zero abundances as being present in the 
sample, the abundances of absent proteins should be zero. Therefore, we minimize the sum 
of maximum peptide abundance from each protein in the objective function so as to shrink 
some protein abundances to 0. 
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• The left-hand side of constraint ([7]) is the difference between the observed and predicted pep- 
tide abundance, bj is viewed as the observed value and the sum of dj k is the predicted value. 
The zero difference means that each peptide should be "counted" only once in distributing 
the abundance of shared peptide. 

• In constraint ([8]), we set dj k = if (jjj,z k ) £ Ph and consider only the remaining elements 
of matrix D as variables. This greatly improves the running efficiency of the LP model. 

• Dost et al. [33J] have presented a similar LP model in their F2 formulation. It aims at inferring 
the protein abundance and peptide detectability simultaneously. The biggest difference 
between these two LP models is that our model sets some protein abundances to while 
Dost's method doesn't. 

After obtaining the matrix D, the protein abundance c k is calculated as: 

Ck = ^2 d i k - ( 9 ) 

{j\(y 3 ,z k )eE 2 } 

2.4 Converting Scores into Probabilities 

After knowing the protein abundance, it is beneficial to convert the abundance into well-calibrated 
probability. The main reason is that the probability estimation allows us to select the appropriate 
threshold for reporting the present proteins. In fact, the problem of converting ranking scores 
into estimated probabilities has been widely investigated in different domains (e.g., [HI]). I n this 
paper, we use the method proposed in [4l[ to fulfill this task. 

We first estimate the probability pk that protein Zk is present in the sample given its abundance 

Ck- 

, x Pr(c k \z k = l)Pr(z k = 1) 1 

Pr(zi, = l\ci.) = v fe| K L \-H i = in 

V 1 k > Pr{c k \z k = l)Pr(z k = 1) + Pr(c k \z k = 0)Pr(z k = 0) 1 + exp(-/ fc ) ' V ; 

where 

, Pr(c k \z k = l)Pr(z k = 1) 
/ fc = log p fc * . 11 

Pr{c k \z k = 0)Pr(z k = 0) 

Assuming f k has a Gaussian distribution with equal covariance matrices, Equation (|1U|) be- 
comes 

1 + exp(^Cfc + B) 

Now, we need to learn the parameters, A and B. Let r k be a binary variable whose value 
is 1 if protein z k is present in the sample and otherwise. Then, R = (r\,T2,--- ,r n ) is the 
presence indicator vector of n candidate proteins. If we assume that the existence of each protein 
is independent with other proteins, the probability of observing R given C is: 

n 

Pr(R\C) = ^2p r k k (l-p k ) 1 - r », (13) 
fc=i 

where C = {c±, C2, • • • , c n }. The optimal parameter values should maximize Pr(R\C), i.e., mini- 
mize the following negative log likelihood function: 

n 

LL(R\C) = [(1 - r k )(-Ac k - B) + log(l + exp(yl Cfc + B))\. (14) 
k=i 

Equation (I14|) is based on the assumption that we have already known the indicator vector R. 
However, we don't know such information in the protein inference process. Thus, we consider r k s 
as hidden variables and employ an EM algorithm [4l| to simultaneously estimate A, B and R. 
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The EM algorithm utilizes an iterative procedure to estimate the parameter values 9 = 
{A, B}. The procedure includes two steps: set r^ +1 = E(r s k \C,9 s ) (E-step) and compute 9 S+1 = 
arg ming LL(R S+1 \ C) (M-step) where s is the iteration index. During the E-step, the unknown 
vector R is replaced by its expected value R s+1 under the current estimated parameter values 
9 s . Since 9 s are fixed, LL(R\C) is minimized by setting = if Ac^ + B > or = 1 if 
Ack + B < 0. During the M step, a new parameter estimation 9 S+1 is computed by minimizing 
LL(R\C) given the R s+1 values calculated by the first step. Since R s = [r|] is fixed, minimizing 
LL{R\C) with respect to A and B is a two-parameter optimization problem, which can be solved 
using the model-trust algorithm described in 



3 Experimental Results 

To test the performance of our methods, we have compared them with ProteinProphet 0] and 
MSBayesPro [13] on six datasets. 



3.1 Datasets 

We use six datasets that are publicly available and their URLs are given in Table 1. Among these 



six datasets, 18 mixtures 43J], Sigma49 and yeast 25J] have a corresponding protein reference 



set as the set of ground-truth proteins. An identified protein is labeled as a true identification 



if it is present in the protein reference set. Another three datasets, DME 44], HumanMD [27j] 
and HumanEKC [2H], have no such sets. Thus, we use a target-decoy strategy for performance 
evaluation, in which the MS/MS spectra are searched against a mixed protein database containing 
all target protein sequences and an equal number of decoy sequences. Using this strategy, an 
identified protein is considered as a true identification if it comes from the target protein database. 
Mixture of 18 Purified Proteins (18 mixtures). The first dataset is a synthetic mixture 



of 18 highly purified proteins from ISB Standard Protein Mix Database 43J]. The protein database 
consists of 1819 protein sequences including 18 standard proteins with contaminant entries ap- 
pended to the database. 

Sigma49 Dataset. Sigma49 is a synthetic mixture of 49 human proteins. The database used 
for peptide identification is composed of 15682 Swiss-Prot human protein sequences. 



Yeast Dataset. This dataset has been used in 25J. The reference set is generated by 
an intersection of identified proteins from 4 MS-based proteomics datasets and 3 non-MS-based 
datasets. It contains 4265 proteins observed in either two or more MS datasets or any of non- 
MS datasets and is available at http://www.marcottelab.org/MSdata/gold/yeastThtmll The 
database used in the experiment contains 6,714 protein sequences. 

D. melanogaster Dataset (DME). DME comes from the embryonal Kc 167 cell line of D. 
melanogaster [44|. Its corresponding protein database is the release 5.2 from Flybase with 20,726 
entries. 



HumanMD Dataset. This dataset has been used in 27)]. Its sample is a medulloblastoma 
Daoy cell line obtained from American Type Culture Collection (ATCC). The protein database 
is Ensembl version 49.36k with 22,997 entries. 

HumanEKC Dataset. HumanEKC has been used in [251 ] . It is generated from a human 
embryonic kidney T293 cell line of ATCC. Its database is the same as that of HumanMD dataset. 



3.2 Peptide Identification 

We use XITandem (v2010.10.01.1) (45| as the peptide identification software. For 18 mixtures, 
Sigma49 and yeast datasets, all MS/MS spectra are only searched against the target protein 
databases. For DME, HumanMD and HumanEKC datasets, the spectra need to search against 
both target and decoy protein databases. 
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Table 1: Dataset URL. 



Dataset 


Raw data URL 


Mixture of 18 Purified Proteins 


http : / / regis-web . systemsbiology . net /PublicDatasets/ 


Sigma49 Dataset 


https : //proteomecommons . org/dataset . j sp?i=71610 


Yeast Dataset 


http : //www.marcottelab . org/users/MSdata/Data_02/ 


D. melanogaster Dataset 


http : //www.peptideatlas . org/repository/ (PAe001349) 


HumanMD Dataset 


http : //www.marcottelab . org/MSdata/Data_05/ 


HumanEKC Dataset 


http : //www.marcottelab . org/MSdata/Data_07/ 



During the database search, we use default search parameters. Then, peptide-spectra matching 
probabilities are computed using PeptideProphet included in Trans-Proteomic Pipeline (TPP) 
v4.5. 

3.3 Protein Inference 

We compare our methods with ProteinProphet and MSBayesPro. ProteinProphet is the most 
popular method for protein inference so far. MSBayesPro is one representative of recently proposed 
methods and its software package is publicly available. We run ProteinProphet with its default 
parameter setting. MSBayesPro uses peptide detectability information as one part of its input. 
For some peptides whose detectabilities cannot be predicted by the current software, we calculate 
them by ourselves: the detectability value = median(predicted detectability scores from the same 
parent protein)/3. 

For the proteins that cannot be distinguished with respect to identified peptides, Protein- 
Prophet and our LP model put all of them into the same group. Whenever we refer to the number 
of true positives (TPs) or false positives (FPs) identified at a threshold or use these values in 
a calculation, all proteins in the group are reported and the group probability is used as their 
protein probabilities. 

3.4 Results 

We evaluate the performance of different methods using a curve that plots the number of TPs as a 
function of q- value. An identified protein is labeled as a TP if it is present in the protein reference 
set or target protein sequence database, and as a FP otherwise. Given a certain probability 
threshold t, suppose there are T t TPs and F t FPs, the false discovery rate (FDR) is estimated as 
FDRt = Ft /{Ft + Tt). The corresponding q- value is defined as the minimal FDR that a protein 
is reported: qt = m.in t / <t FDR t > . The curve is produced by varying the probability threshold t. 

Figure [3] plots the number of TPs identified by five methods at different g-values. It shows 
that our methods are competitive with available protein inference algorithms. Throughout six 
datasets, our three methods can always achieve zero FPs among the highest ranking proteins 
while other two algorithms don't have such a property. This fact indicates that our methods have 
a strong distinction power of protein scores. More specifically, we have the following important 
observations. 

First, the multiple counting method performs the best on DME, HumanMD and HumanEKC 
datasets. For DME and HumanMD, it reports the largest number of TPs under zero (/-value. 
For HumanEKC, it just identifies 17 less proteins than ProteinProphet when q-value=0. Even 
though the multiple counting method doesn't keep such excellent performance on 18 mixtures and 
Sigma49 datasets, it doesn't perform the worst. 

Second, equal division is the best performer (or tied with other algorithms) on 18 mixtures, 
Sigma49 and yeast datasets. Similarly, under zero (/-value, it identifies the most TPs on 18 
mixtures, Sigma49 and yeast datasets. For DME, HumanMD and HumanEKC datasets, equal 
division doesn't have the worst performance as well. It beats at least one algorithm on DME, 
HumanMD and HumanEKC datasets. 
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Figure 3: Identification performance comparison among MSBayesPro (MSB), ProteinProphet 
(PP) and our own three methods: multiple counting (MP), equal division (ED) and linear pro- 
gramming (LP). We only plot the curve up to 0.1 along the x-axis for yeast, DME and HumanMD 
datasets since people are particularly interested in the performance of different algorithms when 
the q-value or FDR is very small. For HumanEKC, the maximum q-value is < 0.04 so that we 
choose 0.03 as the limit of x-axis. We cannot set the g-value range very small for 18 mixtures and 
Sigma49 datasets since the probabilities of top-scoring proteins in the several algorithms are all 
equal to one, hence we have to ship these proteins with same probabilities and then calculate the 
g-value of the first appearing protein with a different probability. 
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Figure 4: Identification performance comparison between the generalized spectral counting meth- 
ods (MP, ED, LP) and the traditional spectral counting methods (NMP, NED, NLP). The y-axis 
is the number of true positives and rc-axis is the corresponding g-value (the minimum FDR to 
report these proteins). The abbreviations for different methods are the same as those in Figure [3l 



Third, the LP model exhibits the most stable identification performance among these five 
methods. More precisely, its performance is at least the third best across all the five datasets. 
Other four algorithms cannot achieve such a property and they perform worse than at least three 
algorithms on some datasets. The number of these datasets is 2, 1, 2, 4 for multiple counting, 
equal division, ProteinProphet and MSBayesPro, respectively. 

In the calculation of protein abundance, we generalize the number of MS/MS spectra to the 
sum of PSM probabilities. We wish such an extension may help us to distinguish proteins with 
the same number of PSMs and further improve the identification performance. To show this fact, 
Figure 0] describes the performance gain when the generalized spectral counting is used instead 
of the traditional spectral counting. The experimental results of these three methods on the six 
datasets agree with our expectation: using the sum of PSM probabilities actually performs better 
than using the number of PSMs. 

After obtaining the protein abundance, we use an EM algorithm to convert the abundance score 
into a well-calibrated probability. Alternatively, we can just normalize this abundance by dividing 
the maximum of all calculated protein abundances. This way gives us a protein score between 
and 1 as well and keeps the holistic distribution of original protein abundance unchanged. Figured] 



9 



18 mixtures Sigma49 Yeast 




Index Index Index 

Figure 5: Comparison of the score distribution between normalized score (NS) and probability 
estimation (PE) when the protein abundance value is generated with the LP model. The scores 
of all the identified proteins are sorted by descending order. 

shows the reason why we adopt the more complex probability estimation approach. It compares 
the distribution of normalized score and estimated probability using protein abundance calculated 
by LP model. For each of the six datasets, the area under the probability estimation curve is larger 
than that under the normalized score curve. It indicates that the probability estimation has a more 
uniform distribution than normalized protein score. Furthermore, the estimated probabilities of 
top-ranking proteins are very close to one but not equal to 1 so as to allow for distinction on a 
fine level. 

4 Conclusion 

Protein inference and protein quantification have been considered as two individual computa- 
tional problems for a long time. In this paper, we investigate the feasibility of solving protein 
inference problem with existing protein quantification methods in the context of label-free protein 
quantification. The experimental results show that such a new angle enables us to obtain better 
identification performance even with some very simple quantification approaches available in the 
literature. 

We have tested three spectral counting methods for solving the protein inference problem. 
These three methods can achieve good performance but none of them are consistently the best 
method on all the datasets. Thus, it is still necessary to develop better algorithms. In the 
future work, we plan to try more quantification methods to check if we can further improve the 
identification performance. 
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